core point
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Afghanistan > Parwan Province > Charikar (0.04)
Oversampling and Downsampling with Core-Boundary Awareness: A Data Quality-Driven Approach
Belhaouari, Samir Brahim, Kahalan, Yunis Carreon, Shaffique, Humaira, Belhaouari, Ismael, Islam, Ashhadul
The effectiveness of machine learning models, particularly in unbalanced classification tasks, is often hindered by the failure to differentiate between critical instances near the decision boundary and redundant samples concentrated in the core of the data distribution. In this paper, we propose a method to systematically identify and differentiate between these two types of data. Through extensive experiments on multiple benchmark datasets, we show that the boundary data oversampling method improves the F1 score by up to 10\% on 96\% of the datasets, whereas our core-aware reduction method compresses datasets up to 90\% while preserving their accuracy, making it 10 times more powerful than the original dataset. Beyond imbalanced classification, our method has broader implications for efficient model training, particularly in computationally expensive domains such as Large Language Model (LLM) training. By prioritizing high-quality, decision-relevant data, our approach can be extended to text, multimodal, and self-supervised learning scenarios, offering a pathway to faster convergence, improved generalization, and significant computational savings. This work paves the way for future research in data-efficient learning, where intelligent sampling replaces brute-force expansion, driving the next generation of AI advancements. Our code is available as a Python package at https://pypi.org/project/adaptive-resampling/ .
- Europe > Netherlands > Limburg > Maastricht (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- (2 more...)
Scalable Varied-Density Clustering via Graph Propagation
Pham, Ninh, Zheng, Yingtao, Phibbs, Hugo
We propose a novel perspective on varied-density clustering for high-dimensional data by framing it as a label propagation process in neighborhood graphs that adapt to local density variations. Our method formally connects density-based clustering with graph connectivity, enabling the use of efficient graph propagation techniques developed in network science. To ensure scalability, we introduce a density-aware neighborhood propagation algorithm and leverage advanced random projection methods to construct approximate neighborhood graphs. Our approach significantly reduces computational cost while preserving clustering quality. Empirically, it scales to datasets with millions of points in minutes and achieves competitive accuracy compared to existing baselines.
- Europe > Netherlands > South Holland > Leiden (0.09)
- Oceania > New Zealand > North Island > Auckland Region > Auckland (0.05)
- Asia > China (0.04)
Unimodal Strategies in Density-Based Clustering
Nir, Oron, Tenenbaum, Jay, Shamir, Ariel
Density-based clustering methods often surpass centroid-based counterparts, when addressing data with noise or arbitrary data distributions common in real-world problems. In this study, we reveal a key property intrinsic to density-based clustering methods regarding the relation between the number of clusters and the neighborhood radius of core points - we empirically show that it is nearly unimodal, and support this claim theoretically in a specific setting. We leverage this property to devise new strategies for finding appropriate values for the radius more efficiently based on the Ternary Search algorithm. This is especially important for large scale data that is high-dimensional, where parameter tuning is computationally intensive. We validate our methodology through extensive applications across a range of high-dimensional, large-scale NLP, Audio, and Computer Vision tasks, demonstrating its practical effectiveness and robustness. This work not only offers a significant advancement in parameter control for density-based clustering but also broadens the understanding regarding the relations between their guiding parameters. Our code is available at https://github.com/oronnir/UnimodalStrategies.
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Middle East > Israel (0.04)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- (11 more...)
Dynamic DBSCAN with Euler Tour Sequences
Shin, Seiyun, Shomorony, Ilan, Macgregor, Peter
We propose a fast and dynamic algorithm for Density-Based Spatial Clustering of Applications with Noise (DBSCAN) that efficiently supports online updates. Traditional DBSCAN algorithms, designed for batch processing, become computationally expensive when applied to dynamic datasets, particularly in large-scale applications where data continuously evolves. To address this challenge, our algorithm leverages the Euler Tour Trees data structure, enabling dynamic clustering updates without the need to reprocess the entire dataset. This approach preserves a near-optimal accuracy in density estimation, as achieved by the state-of-the-art static DBSCAN method (Esfandiari et al., 2021) Our method achieves an improved time complexity of $O(d \log^3(n) + \log^4(n))$ for every data point insertion and deletion, where $n$ and $d$ denote the total number of updates and the data dimension, respectively. Empirical studies also demonstrate significant speedups over conventional DBSCANs in real-time clustering of dynamic datasets, while maintaining comparable or superior clustering quality.
- Europe > Austria > Vienna (0.14)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- Asia > Thailand (0.04)
Rethinking Deep Clustering Paradigms: Self-Supervision Is All You Need
Shaheena, Amal, Mrabahb, Nairouz, Ksantinia, Riadh, Alqaddoumia, Abdulla
The recent advances in deep clustering have been made possible by significant progress in self-supervised and pseudo-supervised learning. However, the trade-off between self-supervision and pseudo-supervision can give rise to three primary issues. The joint training causes Feature Randomness and Feature Drift, whereas the independent training causes Feature Randomness and Feature Twist. In essence, using pseudo-labels generates random and unreliable features. The combination of pseudo-supervision and self-supervision drifts the reliable clustering-oriented features. Moreover, moving from self-supervision to pseudo-supervision can twist the curved latent manifolds. This paper addresses the limitations of existing deep clustering paradigms concerning Feature Randomness, Feature Drift, and Feature Twist. We propose a new paradigm with a new strategy that replaces pseudo-supervision with a second round of self-supervision training. The new strategy makes the transition between instance-level self-supervision and neighborhood-level self-supervision smoother and less abrupt. Moreover, it prevents the drifting effect that is caused by the strong competition between instance-level self-supervision and clustering-level pseudo-supervision. Moreover, the absence of the pseudo-supervision prevents the risk of generating random features. With this novel approach, our paper introduces a Rethinking of the Deep Clustering Paradigms, denoted by R-DC. Our model is specifically designed to address three primary challenges encountered in Deep Clustering: Feature Randomness, Feature Drift, and Feature Twist. Experimental results conducted on six datasets have shown that the two-level self-supervision training yields substantial improvements.
- North America > Canada > Ontario > Toronto (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Asia > Middle East > Bahrain (0.04)
DeepCore: Simple Fingerprint Construction for Differentiating Homologous and Piracy Models
Sun, Haifeng, Zhang, Lan, Li, Xiang-Yang
As intellectual property rights, the copyright protection of deep models is becoming increasingly important. Existing work has made many attempts at model watermarking and fingerprinting, but they have ignored homologous models trained with similar structures or training datasets. We highlight challenges in efficiently querying black-box piracy models to protect model copyrights without misidentifying homologous models. To address these challenges, we propose a novel method called DeepCore, which discovers that the classification confidence of the model is positively correlated with the distance of the predicted sample from the model decision boundary and piracy models behave more similarly at high-confidence classified sample points. Then DeepCore constructs core points far away from the decision boundary by optimizing the predicted confidence of a few sample points and leverages behavioral discrepancies between piracy and homologous models to identify piracy models. Finally, we design different model identification methods, including two similarity-based methods and a clustering-based method to identify piracy models using models' predictions of core points. Extensive experiments show the effectiveness of DeepCore in identifying various piracy models, achieving lower missed and false identification rates, and outperforming state-of-the-art methods.
- North America > United States > California > Alameda County > Oakland (0.04)
- Europe > Greece (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Anhui Province > Hefei (0.04)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
Bi-Filtration and Stability of TDA Mapper for Point Cloud Data
Carlsson, Singh and Memoli's TDA mapper takes a point cloud dataset and outputs a graph that depends on several parameter choices. Dey, Memoli, and Wang developed Multiscale Mapper for abstract topological spaces so that parameter choices can be analyzed via persistent homology. However, when applied to actual data, one does not always obtain filtrations of mapper graphs. DBSCAN, one of the most common clustering algorithms used in the TDA mapper software, has two parameters, \textbf{$\epsilon$} and \textbf{MinPts}. If \textbf{MinPts = 1} then DBSCAN is equivalent to single linkage clustering with cutting height \textbf{$\epsilon$}. We show that if DBSCAN clustering is used with \textbf{MinPts $>$ 2}, a filtration of mapper graphs may not exist except in the absence of free-border points; but such filtrations exist if DBSCAN clustering is used with \textbf{MinPts = 1} or \textbf{2} as the cover size increases, \textbf{$\epsilon$} increases, and/or \textbf{MinPts} decreases. However, the 1-dimensional filtration is unstable. If one adds noise to a data set so that each data point has been perturbed by a distance at most \textbf{$\delta$}, the persistent homology of the mapper graph of the perturbed data set can be significantly different from that of the original data set. We show that we can obtain stability by increasing both the cover size and \textbf{$\epsilon$} at the same time. In particular, we show that the bi-filtrations of the homology groups with respect to cover size and $\epsilon$ between these two datasets are \textbf{2$\delta$}-interleaved.
- North America > United States > Wisconsin > La Crosse County > La Crosse (0.04)
- North America > United States > New Jersey (0.04)
- North America > United States > Iowa (0.04)
- (2 more...)
What's the Opposite of a Face? Finding Shared Decodable Concepts and their Negations in the Brain
Efird, Cory, Murphy, Alex, Zylberberg, Joel, Fyshe, Alona
Prior work has offered evidence for functional localization in the brain; different anatomical regions preferentially activate for certain types of visual input. For example, the fusiform face area preferentially activates for visual stimuli that include a face. However, the spectrum of visual semantics is extensive, and only a few semantically-tuned patches of cortex have so far been identified in the human brain. Using a multimodal (natural language and image) neural network architecture (CLIP) we train a highly accurate contrastive model that maps brain responses during naturalistic image viewing to CLIP embeddings. We then use a novel adaptation of the DBSCAN clustering algorithm to cluster the parameters of these participant-specific contrastive models. This reveals what we call Shared Decodable Concepts (SDCs): clusters in CLIP space that are decodable from common sets of voxels across multiple participants. Examining the images most and least associated with each SDC cluster gives us additional insight into the semantic properties of each SDC. We note SDCs for previously reported visual features (e.g. orientation tuning in early visual cortex) as well as visual semantic concepts such as faces, places and bodies. In cases where our method finds multiple clusters for a visuo-semantic concept, the least associated images allow us to dissociate between confounding factors. For example, we discovered two clusters of food images, one driven by color, the other by shape. We also uncover previously unreported areas such as regions of extrastriate body area (EBA) tuned for legs/hands and sensitivity to numerosity in right intraparietal sulcus, and more. Thus, our contrastive-learning methodology better characterizes new and existing visuo-semantic representations in the brain by leveraging multimodal neural network representations and a novel adaptation of clustering algorithms.